BTS 510 Lab 3

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_classic(base_size = 16))

1 Learning objectives

  • Understand probability as a measure of uncertainty
  • Summarize probabilities with probability distributions
  • Think about uncertainty in sampling
  • Summarize uncertainty in sampling: Sampling distribution

2 Mathematically-derived sampling distribution

  • Many combinations of population distribution and test statistic have mathematically-derived sampling distributions
    • For example, with a normally distributed population (\mathcal{N}(\mu, \sigma^2)), the sampling distribution of the mean is \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})
      • In this case, the population distribution and sampling distribution are the same distribution: Normal
      • The variance (or standard deviation) of the sampling distribution decreases as n increases: A larger sample leads to a smaller, more precise standard error
Note
  • There is not a mathematically-derived sampling distribution for every population distribution and test statistic
    • If you’re dealing with unknown or unusual distributions or test statistics, your only option may be computational methods

3 Computationally-derived sampling distribution

3.1 Getting started

  • Set a random number seed
    • If you don’t do this, every time you run the code, you’ll get different numbers
    • Setting a random seed makes your simulation replicable
    • (I normally set a seed at the start of each document, but I moved it down here because it’s very important in this lab.)
  • Load (and install, if needed) the infer package
    • infer has a function that we’ll use to repeatedly sample
    • There are other ways to do this, but this is very simple and works well
    • You can install a new package a few different ways
      • Via the Console: install.packages("infer")
      • Via the “Packages” pane (bottom right): Click “Install” button, type in the package name, and click “Install”
  • Install only once, load or library once per document
set.seed(12345)
library(infer)

3.2 Standard Normal population: population \sim \mathcal{N}(0, 1)

3.2.1 Create a population

  • rnorm() creates a random variable that follows a normal distribution
    • First argument = Number of observations (population size)
    • Second argument = \mu
    • Third argument = \sigma (not \sigma^2)
  • Create a very large population (N = 1,000,000) with a specified mean and variance
    • Confirm that the mean and SD are what you think they should be
norm_pop <- rnorm(1000000, 0, 1)
mean(norm_pop)
[1] 0.0002941277
sd(norm_pop)
[1] 1.001315
var(norm_pop)
[1] 1.002632

3.2.2 Sample from the population

  • Small sample: n = 10
    • rep_sample_n() repeatedly samples from the dataset
      • Piped (%>%) from the norm_pop dataset, so it resamples from that
      • size = 10: Sample size for each sample
      • reps = 10000: Number of repeated samples (could be larger but this is good enough to show how things work and doesn’t take too long to run)
      • replace = TRUE: Sample with replacement, so observations go back in after each sample and can be sampled again in the next sample
    • summarise(x_bar = mean(norm_pop)): Create a data frame with each of the means from each sample
      • There are 10,000 samples, so there will be 10,000 means in this
    • The ggplot() code creates a histogram from the data
      • Modification: bins = 50 increases the bins to make it a little smoother
      • Modification: xlim(-1.5, 1.5) sets limits on the x-axis to make the plots more comparable
        • I had to run the plots first without this to see how far I needed to let the X axis go. So for later plots here, run without, look at it, then add this with appropriate numbers to put them all on the same scale
      • Make similar modifications in your own plots later in this document
      • We’ll spend the next 2 weeks on ggplot() so you’ll learn all of this (and more!)
norm_means10 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Moderate sample: n = 50
norm_means50 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5) 
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Large sample: n = 100
norm_means100 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.2.3 Compare estimates

  • Compare means for each sample size
mean(norm_means10$x_bar)
[1] -0.003408017
mean(norm_means50$x_bar)
[1] -0.0009473133
mean(norm_means100$x_bar)
[1] 0.001145927
  • Compare standard deviations for each sample size
sd(norm_means10$x_bar)
[1] 0.3131297
sd(norm_means50$x_bar)
[1] 0.1416664
sd(norm_means100$x_bar)
[1] 0.100464
  • Compare variances for each sample size
var(norm_means10$x_bar)
[1] 0.09805018
var(norm_means50$x_bar)
[1] 0.02006937
var(norm_means100$x_bar)
[1] 0.01009301
  • How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution?
    • \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

3.3 Bernoulli population: population \sim B(0.3)

3.3.1 Create a population

  • rbinom() creates a random variable that follows a binomial distribution
    • First argument = Number of observations (population size)
    • Second argument = number of trials (here, just 1 for a Bernoulli trial)
    • Third argument = p (probability of success)
  • Create a very large population (N = 1,000,000) with a specified mean and variance
    • Confirm that the mean and SD are what you think they should be
bern_pop <- rbinom(1000000, 1, 0.3)
mean(bern_pop)
[1] 0.300871
sd(bern_pop)
[1] 0.4586369
var(bern_pop)
[1] 0.2103479

3.3.2 Sample from the population

Note
  • Remember to adjust the X axis using xlim() (as in the Normal section) to have the same range for each plot.
  • You’ll need to make each plot, see what they look like, and then adjust the X axis accordingly.
  • Small sample: n = 10
bern_means10 <- data.frame(bern_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bern_pop))

ggplot(data = bern_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50)

  • Moderate sample: n = 50
  • Large sample: n = 100

3.3.3 Compare estimates

  • Compare means for each sample size
  • Compare standard deviations for each sample size
  • Compare variances for each sample size
  • How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution for a normally distributed population?
    • \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

3.4 Poisson population: population \sim Poisson(1.5)

3.4.1 Create a population

  • rpois() creates a random variable that follows a Poisson distribution
    • First argument = Number of observations (population size)
    • Second argument = \lambda (mean and variance)
      • Remember that \lambda is the variance, but we’re going to look at the standard deviation which is the square root of the variance
pois_pop <- rpois(1000000, 1.5)
mean(pois_pop)
[1] 1.500474
sd(pois_pop)
[1] 1.224257
var(pois_pop)
[1] 1.498805

3.4.2 Sample from the population

Note
  • Remember to adjust the X axis using xlim() (as in the Normal section) to have the same range for each plot.
  • You’ll need to make each plot, see what they look like, and then adjust the X axis accordingly.
  • Small sample: n = 10
pois_means10 <- data.frame(pois_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(pois_pop))

ggplot(data = pois_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50)

  • Moderate sample: n = 50
  • Large sample: n = 100

3.4.3 Compare estimates

  • Compare means for each sample size
  • Compare standard deviations for each sample size
  • Compare variances for each sample size
  • How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution for a normally distributed population?
    • \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

4 Summarize results

  • In general, how does sample size impact the standard error?
    • i.e., the standard deviation of the sampling distribution?
    • Do larger samples lead to smaller or larger standard errors?
  • In general, how does the population distribution impact the sampling distribution?
    • Based on: Visual inspection of the distribution
    • Based on: Estimates of standard error